Contents: NLP Project - 2

  1. Part-A: Solution
  2. Part-B: Solution

Part-A: Solution

  • DOMAIN: Digital content and entertainment industry
  • CONTEXT: The objective of this project is to build a text classification model that analyses the customer's sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to buildan embedding layer followed by a classification algorithm to analyse the sentiment of the customers.
  • Data Description: The Dataset of 50,000 movie reviews from IMDB, labelled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocabulary size of 10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.
  • PROJECT OBJECTIVE: To Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments.
In [ ]:
# Import all the relevant libraries needed to complete the analysis, visualization, modeling and presentation
import pandas as pd
import numpy as np
import os

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')

from scipy import stats
from scipy.stats import zscore

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn import model_selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 
from sklearn.metrics import ConfusionMatrixDisplay, precision_score, recall_score 
from sklearn.metrics import precision_recall_curve, roc_curve, auc, roc_auc_score
from sklearn.metrics import plot_precision_recall_curve, average_precision_score
from sklearn.metrics import f1_score, plot_roc_curve 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.svm import SVC

# from sklearn.decomposition import PCA
# from scipy.cluster.hierarchy import dendrogram, linkage
# from scipy.cluster.hierarchy import fcluster
# from sklearn.cluster import KMeans 
# from sklearn.metrics import silhouette_samples, silhouette_score

# import xgboost as xgb
# from xgboost import plot_importance
# from lightgbm import LGBMClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTENC, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler

import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import SnowballStemmer
import pandas_profiling as pp

import gensim
import logging

# import cv2
# from google.colab.patches import cv2_imshow
# from glob import glob
# import itertools

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.layers import Activation, GlobalMaxPool2D, GlobalAveragePooling2D
from tensorflow.keras.layers import UpSampling2D, Input, Concatenate
from tensorflow.keras.layers import BatchNormalization, LeakyReLU
from tensorflow.keras.optimizers import Adam, RMSprop, SGD, Adagrad

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.metrics import Recall, Precision
from tensorflow.keras import backend as K

from tensorflow import keras
from keras.utils.np_utils import to_categorical  
from keras.utils import np_utils
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor

import warnings
warnings.filterwarnings("ignore")

import random
from zipfile import ZipFile

# Set random_state
random_state = 42

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
# Current working directory
%cd "/content/drive/MyDrive/MGL/Project-NLP-2/"

# # List all the files in a directory
# for dirname, _, filenames in os.walk('path'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))
/content/drive/MyDrive/MGL/Project-NLP-2
In [ ]:
# List files in the directory
!ls
 1.ipynb	    'IMDB Dataset.csv.zip'	   'NLP FAQ Sequential-1.pdf'
 2.ipynb	    'Milestone-NLP 2.pdf'	   'REVALUATION POLICY-7.pdf'
 glove.6B.zip	     model.png
'IMDB Dataset.csv'  'NLP-2_Problem Statement.pdf'

1. Import and analyse the data set.

  • Use imdb.load_data() method
  • Get train and test set
  • Take 10000 most frequent words

Quick EDA for complete dataset

In [ ]:
# # Path of the data file
# path = 'IMDB Dataset.csv.zip'

# # Unzip files in the current directory

# with ZipFile (path,'r') as z:
#   z.extractall() 
# print("Training zip extraction done!")
In [ ]:
# Import the dataset
df = pd.read_csv('IMDB Dataset.csv')
In [ ]:
df.shape
Out[ ]:
(50000, 2)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
In [ ]:
# pd.set_option('display.max_colwidth', None)
df.head()
Out[ ]:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In [ ]:
# Clear the matplotlib plotting backend
%matplotlib inline
plt.close('all')
In [ ]:
# Understand the 'sentiment' the target vector
f,axes=plt.subplots(1,2,figsize=(17,7))
df['sentiment'].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[0])
sns.countplot('sentiment',data=df,ax=axes[1])
axes[0].set_title('Pie Chart for sentiment')
axes[1].set_title('Bar Graph for sentiment')
plt.show()

The dataset consists of two groups:

  • 25000 positive reviews
  • 25000 negative reviews

Its evident that the dataset is very well balanced. This is a very favourable situation for a classification task.

In [ ]:
# Visualize word cloud of random positive and negative review

# Choose randomly a positive review and a negative review
ind_positive = random.choice(list(df[df['sentiment'] == 'positive'].index))
ind_negative = random.choice(list(df[df['sentiment'] == 'negative'].index))

review_positive = df['review'][ind_positive]
review_negative = df['review'][ind_negative]

print('Positive review: ', review_positive)
print('\n')
print('Negative review: ', review_negative)
print('\n')

from wordcloud import WordCloud
cloud_positive = WordCloud().generate(review_positive)
cloud_negative = WordCloud().generate(review_negative)

plt.figure(figsize = (20,15))
plt.subplot(1,2,1)
plt.imshow(cloud_positive)
plt.title('Positive review')

plt.subplot(1,2,2)
plt.imshow(cloud_negative)
plt.title('Negative review')
plt.show()
Positive review:  'War movie' is a Hollywood genre that has been done and redone so many times that clichéd dialogue, rehashed plot and over-the-top action sequences seem unavoidable for any conflict dealing with large-scale combat. Once in a while, however, a war movie comes along that goes against the grain and brings a truly original and compelling story to life on the silver screen. The Civil War-era "Cold Mountain," starring Jude Law, Nicole Kidman and Renée Zellweger is such a film.<br /><br />Then again, calling Cold Mountain" a war movie is not entirely accurate. True enough, the film opens with a (quite literally) quick-and-dirty battle sequence that puts "Glory" director Edward Zwick shame. However, "Cold Mountain" is not so much about the Civil War itself as it is about the period and the people of the times. The story centers around disgruntled Confederate soldier Inman, played by Jude Law, who becomes disgusted with the gruesome war and homesick for the beautiful hamlet of Cold Mountain, North Carolina and the equally beautiful southern belle he left behind, Ada Monroe, played by Nicole Kidman. At first glance, this setup appears formulaic as the romantic interest back home gives the audience enough sympathy to root for the reluctant soldier's tribulations on the battlefield. Indeed, the earlier segments of the film are relatively unimpressive and even somewhat contrived.<br /><br />"Cold Mountain" soon takes a drastic turn, though, as the intrepid hero Inman turns out to be a deserter (incidentally saving the audience from the potentially confusing scenario of wanting to root for the Confederates) and begins a long odyssey homeward. Meanwhile, back at the farm, Ada's cultured ways prove of little use in the fields; soon she is transformed into something of a wilderbeast. Coming to Ada's rescue is the course, tough-as-nails Ruby Thewes, played by Renée Zellweger, who helps Ada put the farm back together and, perhaps more importantly, cope with the loneliness and isolation the war seems to have brought upon Ada.<br /><br />Within these two settings, a vivid, compelling and, at times, very disturbing portrait of the war-torn South unfolds. The characters with whom Inman and Ada interact are surprisingly complex, enhanced by wonderful performances of Brendan Gleeson as Ruby's deadbeat father, Ray Winstone as an unrepentant southern "lawman," and Natalie Portman as a deeply troubled and isolated young mother. All have been greatly affected and changed by "the war of Northern aggression," mostly for the worse. The dark, pervading anti-war message, accented by an effective, haunting score and chillingly beautiful shots of Virginia and North Carolina, is communicated to the audience not so much by gruesome battle scenes as by the scarred land and traumatized people for which the war was fought. Though the weapons and tactics of war itself have changed much in the past century, it's hellish effect on the land is timelessly relevant.<br /><br />Director Anthony Minghella manages to maintain this gloomy mood for most of the film, but the atmosphere is unfortunately denigrated by a rather tepid climax that does little justice to the wonderfully formed characters. The love story between Inman and Ada is awkwardly tacked onto the beginning and end of the film, though the inherently distant, abstracted and even absurd nature of their relationship in a way fits the dismal nature of the rest of the plot.<br /><br />Make no mistake, "Cold Mountain" has neither the traits of a feel-good romance nor an inspiring war drama. It is a unique vision of an era that is sure not only to entertain but also to truly absorb the audience into the lives of a people torn apart by a war and entirely desperate to be rid of its terrible repercussions altogether.


Negative review:  The Twilight Zone has achieved a certain mythology about it--much like Star Trek. That's because there are many devoted lovers of the show that no matter what think every episode was a winner. They are the ones who score each individual show a 10 and cannot objectively evaluate the show. Because of this, a while back I reviewed all the original Star Trek episodes (the good and the bad) because the overall ratings and reviews were just too positive. Now, it's time to do the same for The Twilight Zone.<br /><br />Now I was very surprised when I saw reviews for this bland episode that described it as being "among the best" and gave it scores of 10. If this is the case, then why is it that everyone I know who has seen this episode hates it as much as I do? It's possible that me and my family and friends are all cranks but it's also possible this is yet another case of rabid fans rabidly inflating the rating on an average or below average episode.<br /><br />The episode itself stars William Windom and others as various archetypes--a soldier, a dancer, etc. They are all stuck in a cylindrical room with no escape and only at the end do you realize the "shocking truth"--which isn't at all shocking and is in fact majorly lame. No, this is a badly written and unengaging episode. Yes, there were plenty of episodes of the series that deserved a 10, but few as undeserving as this one due to a shallow script and an unappealing resolution.


In [ ]:
# Text Cleaning
import re

def remove_url(text):
    url_tag = re.compile(r'https://\S+|www\.\S+')
    text = url_tag.sub(r'', text)
    return text

def remove_html(text):
    html_tag = re.compile(r'<.*?>')
    text = html_tag.sub(r'', text)
    return text

def remove_punctuation(text): 
    punct_tag = re.compile(r'[^\w\s]')
    text = punct_tag.sub(r'', text) 
    return text

def remove_special_character(text):
    special_tag = re.compile(r'[^a-zA-Z0-9\s]')
    text = special_tag.sub(r'', text)
    return text

def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text    
    
def clean_text(text):
    text = remove_url(text)
    text = remove_html(text)
    text = remove_punctuation(text)
    text = remove_special_character(text)
    text = remove_emojis(text)
    text = text.lower()
    
    return text
In [ ]:
df['processed'] = df['review'].apply(lambda x: clean_text(x))
df['label'] = df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
df.head()
Out[ ]:
review sentiment processed label
0 One of the other reviewers has mentioned that ... positive one of the other reviewers has mentioned that ... 1
1 A wonderful little production. <br /><br />The... positive a wonderful little production the filming tech... 1
2 I thought this was a wonderful way to spend ti... positive i thought this was a wonderful way to spend ti... 1
3 Basically there's a family where a little boy ... negative basically theres a family where a little boy j... 0
4 Petter Mattei's "Love in the Time of Money" is... positive petter matteis love in the time of money is a ... 1
In [ ]:
# df = df.sample(n=1000, random_state = 0)
In [ ]:
# Create the features matrix and target vector
df1=df[['processed', 'label']]
df1.head()
Out[ ]:
processed label
0 one of the other reviewers has mentioned that ... 1
1 a wonderful little production the filming tech... 1
2 i thought this was a wonderful way to spend ti... 1
3 basically theres a family where a little boy j... 0
4 petter matteis love in the time of money is a ... 1
In [ ]:
# Split the data for training and testing
# To be used in the transformers (BERT)
train, test = train_test_split(df1, test_size=0.5, random_state=0)

Using the imdb.load_data() method

In [ ]:
# Loading the IMDB dataset
# The argument num_words=10000 keeps the top 10,000 most frequently occurring words in the training data. 
# The rare words are discarded to keep the size of the data manageable.

top_words = 10000
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data(path="imdb.npz",
                                                      num_words=top_words)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 [==============================] - 1s 0us/step
In [ ]:
X_train
Out[ ]:
array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]),
       list([1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5974, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2401, 311, 12, 16, 3711, 33, 75, 43, 1829, 296, 4, 86, 320, 35, 534, 19, 263, 4821, 1301, 4, 1873, 33, 89, 78, 12, 66, 16, 4, 360, 7, 4, 58, 316, 334, 11, 4, 1716, 43, 645, 662, 8, 257, 85, 1200, 42, 1228, 2578, 83, 68, 3912, 15, 36, 165, 1539, 278, 36, 69, 2, 780, 8, 106, 14, 6905, 1338, 18, 6, 22, 12, 215, 28, 610, 40, 6, 87, 326, 23, 2300, 21, 23, 22, 12, 272, 40, 57, 31, 11, 4, 22, 47, 6, 2307, 51, 9, 170, 23, 595, 116, 595, 1352, 13, 191, 79, 638, 89, 2, 14, 9, 8, 106, 607, 624, 35, 534, 6, 227, 7, 129, 113]),
       ...,
       list([1, 11, 6, 230, 245, 6401, 9, 6, 1225, 446, 2, 45, 2174, 84, 8322, 4007, 21, 4, 912, 84, 2, 325, 725, 134, 2, 1715, 84, 5, 36, 28, 57, 1099, 21, 8, 140, 8, 703, 5, 2, 84, 56, 18, 1644, 14, 9, 31, 7, 4, 9406, 1209, 2295, 2, 1008, 18, 6, 20, 207, 110, 563, 12, 8, 2901, 2, 8, 97, 6, 20, 53, 4767, 74, 4, 460, 364, 1273, 29, 270, 11, 960, 108, 45, 40, 29, 2961, 395, 11, 6, 4065, 500, 7, 2, 89, 364, 70, 29, 140, 4, 64, 4780, 11, 4, 2678, 26, 178, 4, 529, 443, 2, 5, 27, 710, 117, 2, 8123, 165, 47, 84, 37, 131, 818, 14, 595, 10, 10, 61, 1242, 1209, 10, 10, 288, 2260, 1702, 34, 2901, 2, 4, 65, 496, 4, 231, 7, 790, 5, 6, 320, 234, 2766, 234, 1119, 1574, 7, 496, 4, 139, 929, 2901, 2, 7750, 5, 4241, 18, 4, 8497, 2, 250, 11, 1818, 7561, 4, 4217, 5408, 747, 1115, 372, 1890, 1006, 541, 9303, 7, 4, 59, 2, 4, 3586, 2]),
       list([1, 1446, 7079, 69, 72, 3305, 13, 610, 930, 8, 12, 582, 23, 5, 16, 484, 685, 54, 349, 11, 4120, 2959, 45, 58, 1466, 13, 197, 12, 16, 43, 23, 2, 5, 62, 30, 145, 402, 11, 4131, 51, 575, 32, 61, 369, 71, 66, 770, 12, 1054, 75, 100, 2198, 8, 4, 105, 37, 69, 147, 712, 75, 3543, 44, 257, 390, 5, 69, 263, 514, 105, 50, 286, 1814, 23, 4, 123, 13, 161, 40, 5, 421, 4, 116, 16, 897, 13, 2, 40, 319, 5872, 112, 6700, 11, 4803, 121, 25, 70, 3468, 4, 719, 3798, 13, 18, 31, 62, 40, 8, 7200, 4, 2, 7, 14, 123, 5, 942, 25, 8, 721, 12, 145, 5, 202, 12, 160, 580, 202, 12, 6, 52, 58, 2, 92, 401, 728, 12, 39, 14, 251, 8, 15, 251, 5, 2, 12, 38, 84, 80, 124, 12, 9, 23]),
       list([1, 17, 6, 194, 337, 7, 4, 204, 22, 45, 254, 8, 106, 14, 123, 4, 2, 270, 2, 5, 2, 2, 732, 2098, 101, 405, 39, 14, 1034, 4, 1310, 9, 115, 50, 305, 12, 47, 4, 168, 5, 235, 7, 38, 111, 699, 102, 7, 4, 4039, 9245, 9, 24, 6, 78, 1099, 17, 2345, 2, 21, 27, 9685, 6139, 5, 2, 1603, 92, 1183, 4, 1310, 7, 4, 204, 42, 97, 90, 35, 221, 109, 29, 127, 27, 118, 8, 97, 12, 157, 21, 6789, 2, 9, 6, 66, 78, 1099, 4, 631, 1191, 5, 2642, 272, 191, 1070, 6, 7585, 8, 2197, 2, 2, 544, 5, 383, 1271, 848, 1468, 2, 497, 2, 8, 1597, 8778, 2, 21, 60, 27, 239, 9, 43, 8368, 209, 405, 10, 10, 12, 764, 40, 4, 248, 20, 12, 16, 5, 174, 1791, 72, 7, 51, 6, 1739, 22, 4, 204, 131, 9])],
      dtype=object)
In [ ]:
y_train
Out[ ]:
array([1, 0, 0, ..., 0, 1, 0])

2. Perform relevant sequence adding on the data.

3. Perform following data analysis:

  • Print shape of features and labels
  • Print value of any one feature and it's label

4. Decode the feature value to get original sentence

Considering the above 2nd, 3rd, and 4th parts together in below code cells:

Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review. Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

In [ ]:
# Shape of training data
print("X_train: {}, y_train: {}".format(len(X_train),len(y_train)))
X_train: 25000, y_train: 25000
In [ ]:
# Shape of test data
print("X_test: {}, y_test: {}".format(len(X_test),len(y_test)))
X_test: 25000, y_test: 25000
In [ ]:
# The text of reviews have been converted to integers, where each integer represents a specific word in a dictionary. 
# Looking at the first review
print(X_train[0])
print(y_train[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
1
In [ ]:
# Movie reviews may be different lengths. The below code shows the number of words in the first and second reviews. 
# Since inputs to a NN/RNN must be the same length, we'll need to resolve this later.
len(X_train[0]), len(X_train[1])
Out[ ]:
(218, 189)
In [ ]:
# Convert integers back to text: Here, we'll create a helper function to query a dictionary object that contains the integer to string mapping:

# A dictionary mapping words to an integer index
imdb = keras.datasets.imdb
word_index = imdb.get_word_index()

# The first indices are reserved

word_index = {k:(v+3) for k,v in word_index.items()}

word_index["<PAD>"] =0
word_index["<START>"]=1
word_index["<UNK>"]=2 #unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i,'?') for i in text])
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1641221/1641221 [==============================] - 0s 0us/step
In [ ]:
decode_review(X_train[0])
Out[ ]:
"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
In [ ]:
decode_review(X_train[1])
Out[ ]:
"<START> big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal <UNK> the hair is big lots of boobs <UNK> men wear those cut <UNK> shirts that show off their <UNK> sickening that men actually wore them and the music is just <UNK> trash that plays over and over again in almost every scene there is trashy music boobs and <UNK> taking away bodies and the gym still doesn't close for <UNK> all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then"

The reviews (intteger arrays) must be converted to tensors before fed into the neural network. This conversion can be done in many ways:

  • One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [1, 5, 6] would become a 10,000-dimensional vector that is all zeros except for indices at 1, 5 and 6, which are ones. Then, make this the first layer in our network—a Dense layer—that can handle floating point vector data. This approach is memory intensive, though, requiring a num_words * num_reviews size matrix.
  • Another method, we can pad the arrays so that they all have the same length, then create an integer tensor of shape max_length * num_reviews. We can use an embedding layer capable of handling this shape as the first layer in our network. Since the movie reviews must be the same length, we will use the pad_sequences function to standardize the lengths.
In [ ]:
# pad_sequences is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning
# of each sequence until each sequence has the same length as the longest sequence.

#Since the sequences have different lengtht, then we use padding method to put all sequences to the same length. 
#The parameter "maxlen" sets the maximum length of the output sequence. 

# If length of the input sequence is larger than "maxlen", then it is trunced to keep only #maxlen words, (truncating = 'pre': keep the 
# previous part of the sequence; truncating = 'post': keep the posterior part of the sequence).

# If length of the input sequence is smaller than "maxlen", then 0 elements will be padded into the previous part of sequence 
# (if padding = 'pre' - by defaut) or in the tail of the sequence (if padding = 'post').

max_length = 256
trunc_type = 'post'

X_train = keras.preprocessing.sequence.pad_sequences(X_train, value=word_index["<PAD>"],padding="post",maxlen = max_length, truncating = trunc_type)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, value=word_index["<PAD>"],padding="post",maxlen = max_length, truncating = trunc_type)
In [ ]:
# Check the length of reviews again
len(X_train[0]), len(X_train[1])
Out[ ]:
(256, 256)
In [ ]:
# Check the first review after padding
X_train[0]
Out[ ]:
array([   1,   14,   22,   16,   43,  530,  973, 1622, 1385,   65,  458,
       4468,   66, 3941,    4,  173,   36,  256,    5,   25,  100,   43,
        838,  112,   50,  670,    2,    9,   35,  480,  284,    5,  150,
          4,  172,  112,  167,    2,  336,  385,   39,    4,  172, 4536,
       1111,   17,  546,   38,   13,  447,    4,  192,   50,   16,    6,
        147, 2025,   19,   14,   22,    4, 1920, 4613,  469,    4,   22,
         71,   87,   12,   16,   43,  530,   38,   76,   15,   13, 1247,
          4,   22,   17,  515,   17,   12,   16,  626,   18,    2,    5,
         62,  386,   12,    8,  316,    8,  106,    5,    4, 2223, 5244,
         16,  480,   66, 3785,   33,    4,  130,   12,   16,   38,  619,
          5,   25,  124,   51,   36,  135,   48,   25, 1415,   33,    6,
         22,   12,  215,   28,   77,   52,    5,   14,  407,   16,   82,
          2,    8,    4,  107,  117, 5952,   15,  256,    4,    2,    7,
       3766,    5,  723,   36,   71,   43,  530,  476,   26,  400,  317,
         46,    7,    4,    2, 1029,   13,  104,   88,    4,  381,   15,
        297,   98,   32, 2071,   56,   26,  141,    6,  194, 7486,   18,
          4,  226,   22,   21,  134,  476,   26,  480,    5,  144,   30,
       5535,   18,   51,   36,   28,  224,   92,   25,  104,    4,  226,
         65,   16,   38, 1334,   88,   12,   16,  283,    5,   16, 4472,
        113,  103,   32,   15,   16, 5345,   19,  178,   32,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0], dtype=int32)

5. Design, train, tune and test a sequential model.

Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new approaches to design the best model.

ANN

In [ ]:
# Input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000
embedding_dim = 16
max_length = 256

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 256, 16)           160000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
In [ ]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test))
Epoch 1/10
196/196 [==============================] - 37s 162ms/step - loss: 0.6659 - accuracy: 0.6778 - val_loss: 0.6032 - val_accuracy: 0.7760
Epoch 2/10
196/196 [==============================] - 15s 77ms/step - loss: 0.4866 - accuracy: 0.8292 - val_loss: 0.4177 - val_accuracy: 0.8423
Epoch 3/10
196/196 [==============================] - 10s 51ms/step - loss: 0.3396 - accuracy: 0.8758 - val_loss: 0.3461 - val_accuracy: 0.8603
Epoch 4/10
196/196 [==============================] - 7s 34ms/step - loss: 0.2774 - accuracy: 0.8951 - val_loss: 0.3198 - val_accuracy: 0.8689
Epoch 5/10
196/196 [==============================] - 7s 34ms/step - loss: 0.2417 - accuracy: 0.9088 - val_loss: 0.3118 - val_accuracy: 0.8717
Epoch 6/10
196/196 [==============================] - 4s 21ms/step - loss: 0.2173 - accuracy: 0.9192 - val_loss: 0.3045 - val_accuracy: 0.8767
Epoch 7/10
196/196 [==============================] - 4s 21ms/step - loss: 0.1969 - accuracy: 0.9282 - val_loss: 0.3086 - val_accuracy: 0.8749
Epoch 8/10
196/196 [==============================] - 4s 21ms/step - loss: 0.1806 - accuracy: 0.9348 - val_loss: 0.3130 - val_accuracy: 0.8730
Epoch 9/10
196/196 [==============================] - 4s 19ms/step - loss: 0.1669 - accuracy: 0.9409 - val_loss: 0.3265 - val_accuracy: 0.8693
Epoch 10/10
196/196 [==============================] - 3s 14ms/step - loss: 0.1545 - accuracy: 0.9468 - val_loss: 0.3296 - val_accuracy: 0.8708
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
782/782 [==============================] - 1s 1ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
782/782 [==============================] - 2s 2ms/step - loss: 0.1397 - accuracy: 0.9546
Loss and Accuracy on Training data: [0.13974815607070923, 0.9546399712562561]
782/782 [==============================] - 2s 2ms/step - loss: 0.3296 - accuracy: 0.8708
Loss and Accuracy on Test data: [0.32960832118988037, 0.8707600235939026]

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.87      0.87     12500
           1       0.87      0.87      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

base_1 = []
base_1.append(['ANN', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
782/782 [==============================] - 2s 3ms/step - loss: 0.1397 - accuracy: 0.9546
782/782 [==============================] - 2s 2ms/step - loss: 0.3296 - accuracy: 0.8708
In [ ]:
# An approach for predicted and actual labels
for i in range(5):
  print(decode_review(X_test[i]))
  pred = model.predict(X_test[i].reshape(1, 256))
  print('Prediction prob = ', pred, '\t Actual =', y_test[i])
<START> please give this one a miss br br <UNK> <UNK> and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite <UNK> so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
1/1 [==============================] - 0s 20ms/step
Prediction prob =  [[0.08360814]] 	 Actual = 0
<START> this film requires a lot of patience because it focuses on mood and character development the plot is very simple and many of the scenes take place on the same set in frances <UNK> the sandy dennis character apartment but the film builds to a disturbing climax br br the characters create an atmosphere <UNK> with sexual tension and psychological <UNK> it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely one of her darker characters but she plays it so perfectly and convincingly that it's scary michael burns does a good job as the mute young man regular altman player michael murphy has a small part the <UNK> moody set fits the content of the story very well in short this movie is a powerful study of loneliness sexual <UNK> and desperation be patient <UNK> up the atmosphere and pay attention to the wonderfully written script br br i praise robert altman this is one of his many films that deals with unconventional fascinating subject matter this film is disturbing but it's sincere and it's sure to <UNK> a strong emotional response from the viewer if you want to see an unusual film some might even say bizarre this is worth the time br br unfortunately it's very difficult to find in video stores you may have to buy
1/1 [==============================] - 0s 19ms/step
Prediction prob =  [[0.99994504]] 	 Actual = 1
<START> many animation buffs consider <UNK> <UNK> the great forgotten genius of one special branch of the art puppet animation which he invented almost single <UNK> and as it happened almost accidentally as a young man <UNK> was more interested in <UNK> than the cinema but his <UNK> attempt to film two <UNK> <UNK> fighting led to an unexpected breakthrough in film making when he realized he could <UNK> movement by <UNK> beetle <UNK> and <UNK> them one frame at a time this discovery led to the production of amazingly elaborate classic short the <UNK> revenge which he made in russia in <UNK> at a time when motion picture animation of all sorts was in its <UNK> br br the political <UNK> of the russian revolution caused <UNK> to move to paris where one of his first productions <UNK> was a dark political satire <UNK> known as <UNK> or the <UNK> who wanted a king a strain of black comedy can be found in almost all of films but here it is very dark indeed aimed more at grown ups who can appreciate the satirical aspects than children who would most likely find the climax <UNK> i'm middle aged and found it pretty <UNK> myself and indeed <UNK> of the film intended for english speaking viewers of the 1920s were given title cards filled with <UNK> and <UNK> in order to help <UNK> the sharp <UNK> of the finale br br our tale is set in a swamp the <UNK> <UNK> where the citizens are unhappy
1/1 [==============================] - 0s 17ms/step
Prediction prob =  [[0.9172134]] 	 Actual = 1
<START> i generally love this type of movie however this time i found myself wanting to kick the screen since i can't do that i will just complain about it this was absolutely idiotic the things that happen with the dead kids are very cool but the alive people are absolute idiots i am a grown man pretty big and i can defend myself well however i would not do half the stuff the little girl does in this movie also the mother in this movie is reckless with her children to the point of neglect i wish i wasn't so angry about her and her actions because i would have otherwise enjoyed the flick what a number she was take my advise and fast forward through everything you see her do until the end also is anyone else getting sick of watching movies that are filmed so dark anymore one can hardly see what is being filmed as an audience we are <UNK> involved with the actions on the screen so then why the hell can't we have night vision <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
1/1 [==============================] - 0s 19ms/step
Prediction prob =  [[0.5623875]] 	 Actual = 0
<START> like some other people wrote i'm a die hard mario fan and i loved this game br br this game starts slightly boring but trust me it's worth it as soon as you start your hooked the levels are fun and <UNK> they will hook you <UNK> your mind turns to <UNK> i'm not kidding this game is also <UNK> and is beautifully done br br to keep this spoiler free i have to keep my mouth shut about details but please try this game it'll be worth it br br story 9 9 action 10 1 it's that good <UNK> 10 attention <UNK> 10 average 10 <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
1/1 [==============================] - 0s 17ms/step
Prediction prob =  [[0.9858036]] 	 Actual = 1

RNN

In [ ]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, AveragePooling1D, Bidirectional, LSTM, SimpleRNN, Dense

vocab_size = 10000
embedding_dim = 16
max_length = 256

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length = max_length))
model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(AveragePooling1D(pool_size = 2))
# model.add(Bidirectional(SimpleRNN(32, dropout = 0.5)))
model.add((SimpleRNN(32, dropout = 0.5)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test), verbose=1)
Epoch 1/10
196/196 [==============================] - 58s 263ms/step - loss: 0.6948 - accuracy: 0.4960 - val_loss: 0.6933 - val_accuracy: 0.5015
Epoch 2/10
196/196 [==============================] - 46s 234ms/step - loss: 0.6902 - accuracy: 0.5303 - val_loss: 0.6936 - val_accuracy: 0.5032
Epoch 3/10
196/196 [==============================] - 45s 228ms/step - loss: 0.6801 - accuracy: 0.5534 - val_loss: 0.6950 - val_accuracy: 0.5063
Epoch 4/10
196/196 [==============================] - 40s 202ms/step - loss: 0.6437 - accuracy: 0.5886 - val_loss: 0.7211 - val_accuracy: 0.4956
Epoch 5/10
196/196 [==============================] - 40s 206ms/step - loss: 0.5776 - accuracy: 0.6360 - val_loss: 0.7716 - val_accuracy: 0.5038
Epoch 6/10
196/196 [==============================] - 50s 255ms/step - loss: 0.5253 - accuracy: 0.6641 - val_loss: 0.8398 - val_accuracy: 0.5025
Epoch 7/10
196/196 [==============================] - 37s 187ms/step - loss: 0.4871 - accuracy: 0.6875 - val_loss: 0.9328 - val_accuracy: 0.5018
Epoch 8/10
196/196 [==============================] - 39s 199ms/step - loss: 0.4655 - accuracy: 0.7023 - val_loss: 0.9334 - val_accuracy: 0.4962
Epoch 9/10
196/196 [==============================] - 38s 194ms/step - loss: 0.4452 - accuracy: 0.7184 - val_loss: 0.9710 - val_accuracy: 0.4993
Epoch 10/10
196/196 [==============================] - 39s 197ms/step - loss: 0.4318 - accuracy: 0.7268 - val_loss: 1.0690 - val_accuracy: 0.5021
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
782/782 [==============================] - 11s 14ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
782/782 [==============================] - 10s 12ms/step - loss: 0.3803 - accuracy: 0.7723
Loss and Accuracy on Training data: [0.3802814185619354, 0.7723199725151062]
782/782 [==============================] - 11s 14ms/step - loss: 1.0690 - accuracy: 0.5021
Loss and Accuracy on Test data: [1.069017767906189, 0.5020800232887268]

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.48      0.49     12500
           1       0.50      0.53      0.51     12500

    accuracy                           0.50     25000
   macro avg       0.50      0.50      0.50     25000
weighted avg       0.50      0.50      0.50     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

# base_1 = []
base_1.append(['RNN', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
782/782 [==============================] - 11s 14ms/step - loss: 0.3803 - accuracy: 0.7723
782/782 [==============================] - 10s 12ms/step - loss: 1.0690 - accuracy: 0.5021

GRU

In [ ]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, AveragePooling1D, Bidirectional, LSTM, SimpleRNN, GRU, Dense

vocab_size = 10000
embedding_dim = 16
max_length = 256

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length = max_length))
model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(AveragePooling1D(pool_size = 2))
# model.add(Bidirectional(GRU(32, dropout = 0.5)))
model.add((GRU(32, dropout = 0.5)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test), verbose=1)
Epoch 1/10
196/196 [==============================] - 25s 110ms/step - loss: 0.6931 - accuracy: 0.4998 - val_loss: 0.6926 - val_accuracy: 0.5123
Epoch 2/10
196/196 [==============================] - 13s 65ms/step - loss: 0.5302 - accuracy: 0.7149 - val_loss: 0.3926 - val_accuracy: 0.8400
Epoch 3/10
196/196 [==============================] - 10s 51ms/step - loss: 0.2852 - accuracy: 0.8911 - val_loss: 0.3305 - val_accuracy: 0.8644
Epoch 4/10
196/196 [==============================] - 8s 40ms/step - loss: 0.2116 - accuracy: 0.9240 - val_loss: 0.3274 - val_accuracy: 0.8690
Epoch 5/10
196/196 [==============================] - 6s 30ms/step - loss: 0.1682 - accuracy: 0.9420 - val_loss: 0.3620 - val_accuracy: 0.8621
Epoch 6/10
196/196 [==============================] - 5s 26ms/step - loss: 0.1424 - accuracy: 0.9539 - val_loss: 0.4128 - val_accuracy: 0.8568
Epoch 7/10
196/196 [==============================] - 6s 32ms/step - loss: 0.1172 - accuracy: 0.9619 - val_loss: 0.4438 - val_accuracy: 0.8379
Epoch 8/10
196/196 [==============================] - 4s 19ms/step - loss: 0.0953 - accuracy: 0.9708 - val_loss: 0.4457 - val_accuracy: 0.8510
Epoch 9/10
196/196 [==============================] - 4s 20ms/step - loss: 0.0834 - accuracy: 0.9754 - val_loss: 0.4381 - val_accuracy: 0.8462
Epoch 10/10
196/196 [==============================] - 5s 26ms/step - loss: 0.0713 - accuracy: 0.9802 - val_loss: 0.5104 - val_accuracy: 0.8458
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
782/782 [==============================] - 3s 3ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
782/782 [==============================] - 3s 4ms/step - loss: 0.0559 - accuracy: 0.9868
Loss and Accuracy on Training data: [0.05585106462240219, 0.986840009689331]
782/782 [==============================] - 3s 4ms/step - loss: 0.5104 - accuracy: 0.8458
Loss and Accuracy on Test data: [0.5104315876960754, 0.8458399772644043]

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.82      0.84     12500
           1       0.83      0.87      0.85     12500

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

# base_1 = []
base_1.append(['GRU', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
782/782 [==============================] - 4s 5ms/step - loss: 0.0559 - accuracy: 0.9868
782/782 [==============================] - 3s 4ms/step - loss: 0.5104 - accuracy: 0.8458

LSTM

In [ ]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, AveragePooling1D, Bidirectional, LSTM, Dense

vocab_size = 10000
embedding_dim = 16
max_length = 256

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length = max_length))
model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(AveragePooling1D(pool_size = 2))
# model.add(Bidirectional(LSTM(32, dropout = 0.5)))
model.add((LSTM(32, dropout = 0.5)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test), verbose=1)
Epoch 1/10
196/196 [==============================] - 23s 106ms/step - loss: 0.6890 - accuracy: 0.5211 - val_loss: 0.6461 - val_accuracy: 0.6796
Epoch 2/10
196/196 [==============================] - 14s 72ms/step - loss: 0.6532 - accuracy: 0.6355 - val_loss: 0.6937 - val_accuracy: 0.5099
Epoch 3/10
196/196 [==============================] - 10s 50ms/step - loss: 0.6877 - accuracy: 0.5196 - val_loss: 0.6901 - val_accuracy: 0.5106
Epoch 4/10
196/196 [==============================] - 8s 40ms/step - loss: 0.6744 - accuracy: 0.5472 - val_loss: 0.6212 - val_accuracy: 0.6864
Epoch 5/10
196/196 [==============================] - 6s 30ms/step - loss: 0.6710 - accuracy: 0.5767 - val_loss: 0.6817 - val_accuracy: 0.5437
Epoch 6/10
196/196 [==============================] - 5s 27ms/step - loss: 0.6416 - accuracy: 0.6128 - val_loss: 0.5777 - val_accuracy: 0.7240
Epoch 7/10
196/196 [==============================] - 4s 23ms/step - loss: 0.5502 - accuracy: 0.7409 - val_loss: 0.6071 - val_accuracy: 0.6805
Epoch 8/10
196/196 [==============================] - 5s 26ms/step - loss: 0.4877 - accuracy: 0.8031 - val_loss: 0.5491 - val_accuracy: 0.7701
Epoch 9/10
196/196 [==============================] - 4s 20ms/step - loss: 0.4833 - accuracy: 0.7963 - val_loss: 0.5417 - val_accuracy: 0.7584
Epoch 10/10
196/196 [==============================] - 4s 22ms/step - loss: 0.6782 - accuracy: 0.5744 - val_loss: 0.6883 - val_accuracy: 0.5086
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
782/782 [==============================] - 3s 4ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
782/782 [==============================] - 4s 5ms/step - loss: 0.6718 - accuracy: 0.5417
Loss and Accuracy on Training data: [0.6718079447746277, 0.5416799783706665]
782/782 [==============================] - 4s 5ms/step - loss: 0.6883 - accuracy: 0.5086
Loss and Accuracy on Test data: [0.6883372068405151, 0.5085600018501282]

Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.74      0.60     12500
           1       0.52      0.28      0.36     12500

    accuracy                           0.51     25000
   macro avg       0.51      0.51      0.48     25000
weighted avg       0.51      0.51      0.48     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

# base_1 = []
base_1.append(['LSTM', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
782/782 [==============================] - 4s 5ms/step - loss: 0.6718 - accuracy: 0.5417
782/782 [==============================] - 4s 5ms/step - loss: 0.6883 - accuracy: 0.5086

Logistic Regression

In [ ]:
# Build the model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 0.5434
Accuracy on Test data: 0.50824

Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.40      0.45     12500
           1       0.51      0.61      0.55     12500

    accuracy                           0.51     25000
   macro avg       0.51      0.51      0.50     25000
weighted avg       0.51      0.51      0.50     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['Logistic Regression', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)

KNN

In [ ]:
# Build the model
model = KNeighborsClassifier()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 0.6864
Accuracy on Test data: 0.50252

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.54      0.52     12500
           1       0.50      0.47      0.48     12500

    accuracy                           0.50     25000
   macro avg       0.50      0.50      0.50     25000
weighted avg       0.50      0.50      0.50     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['K Neighbors', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)

SVM

In [ ]:
# Build the model
model = SVC()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 0.82956
Accuracy on Test data: 0.52644

Classification Report:
               precision    recall  f1-score   support

           0       0.52      0.58      0.55     12500
           1       0.53      0.47      0.50     12500

    accuracy                           0.53     25000
   macro avg       0.53      0.53      0.53     25000
weighted avg       0.53      0.53      0.53     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['SVM', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)

Multinomial NB

In [ ]:
# Build the model
model = MultinomialNB()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 0.54076
Accuracy on Test data: 0.5078

Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.56      0.53     12500
           1       0.51      0.46      0.48     12500

    accuracy                           0.51     25000
   macro avg       0.51      0.51      0.51     25000
weighted avg       0.51      0.51      0.51     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['Multinomial NB', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)

Decision Tree

In [ ]:
# Build the model
model = DecisionTreeClassifier()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 1.0
Accuracy on Test data: 0.50896

Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.51      0.51     12500
           1       0.51      0.51      0.51     12500

    accuracy                           0.51     25000
   macro avg       0.51      0.51      0.51     25000
weighted avg       0.51      0.51      0.51     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['Decision Tree', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)

Random Forest

In [ ]:
# Build the model
model = RandomForestClassifier()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 1.0
Accuracy on Test data: 0.53252

Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.57      0.55     12500
           1       0.54      0.49      0.51     12500

    accuracy                           0.53     25000
   macro avg       0.53      0.53      0.53     25000
weighted avg       0.53      0.53      0.53     25000

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['Random Forest', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)

Ada Boost

In [ ]:
# Build the model
model = AdaBoostClassifier()
# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification Accuracy
print("Classification Accuracy:")
print('Accuracy on Training data:',model.score(X_train, y_train))
print('Accuracy on Test data:',model.score(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
Accuracy on Training data: 0.581
Accuracy on Test data: 0.53596

Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.57      0.55     12500
           1       0.54      0.50      0.52     12500

    accuracy                           0.54     25000
   macro avg       0.54      0.54      0.54     25000
weighted avg       0.54      0.54      0.54     25000

Confusion Matrix Chart:

Model Comparison

In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.score(X_train, y_train)
Test_Accuracy = model.score(X_test, y_test)

# base_1 = []
base_1.append(['Ada Boost', Train_Accuracy, Test_Accuracy, precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
model_comparison
Out[ ]:
Model Train Accuracy Test Accuracy Precision Recall F1 Score
0 ANN 0.95464 0.87076 0.870786 0.87076 0.870758
2 GRU 0.98684 0.84584 0.846772 0.84584 0.845736
10 Ada Boost 0.58100 0.53596 0.536109 0.53596 0.535481
9 Random Forest 1.00000 0.53252 0.532747 0.53252 0.531708
6 SVM 0.82956 0.52644 0.526726 0.52644 0.525170
8 Decision Tree 1.00000 0.50896 0.508961 0.50896 0.508953
3 LSTM 0.54168 0.50856 0.510876 0.50856 0.480929
4 Logistic Regression 0.54340 0.50824 0.508616 0.50824 0.502821
7 Multinomial NB 0.54076 0.50780 0.507876 0.50780 0.506615
5 K Neighbors 0.68640 0.50252 0.502532 0.50252 0.501910
1 RNN 0.77232 0.50208 0.502086 0.50208 0.501743

BERT

In [ ]:
# Install Transformers library
!pip install transformers
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 70.9 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (1.21.6)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers) (23.0)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 KB 27.6 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers) (3.9.0)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers) (2.25.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.8/dist-packages (from transformers) (4.64.1)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 107.0 MB/s eta 0:00:00
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (4.4.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2022.12.7)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.1
In [ ]:
# Load the BERT Classifier and Tokenizer alıng with Input modules
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [ ]:
# We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:
model.summary()
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. 
# We will take advantage of the InputExample function that helps us to create sequences from our dataset. 
# The InputExample function can be called as follows:
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)
Out[ ]:
InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

Now we will create two main functions:

  1. convert_data_to_examples: This will accept our train and test datasets and convert each row into an InputExample object.
  2. convert_examples_to_tf_dataset: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.
In [ ]:
def convert_data_to_examples(train, test, processed, label): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[processed], 
                                                          text_b = None,
                                                          label = x[label]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[processed], 
                                                          text_b = None,
                                                          label = x[label]), axis = 1)
  
  return train_InputExamples, validation_InputExamples

  train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'processed', 
                                                                           'label')
  
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


processed = 'processed'
label = 'label'
In [ ]:
# Our dataset containing processed input sequences are ready to be fed to the model.
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, processed, label)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)
In [ ]:
# We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. 
# Fine-tuning the model for 2 epochs will give us around 90% accuracy, which is great.

# Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings.

%%time

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

H = model.fit(train_data, epochs=2, validation_data=validation_data)

# 50 min for maxlen = 128
Epoch 1/2
1564/1564 [==============================] - 1544s 952ms/step - loss: 0.2533 - accuracy: 0.8938 - val_loss: 0.3117 - val_accuracy: 0.8928
Epoch 2/2
1564/1564 [==============================] - 1523s 974ms/step - loss: 0.0705 - accuracy: 0.9759 - val_loss: 0.4324 - val_accuracy: 0.8890
CPU times: user 25min 45s, sys: 9min 23s, total: 35min 9s
Wall time: 52min 6s

6. Use the designed model to print the prediction on any one sample.

In [ ]:
# Making Predictions
# I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.
pred_sentences = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
                  'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie']
In [ ]:
# We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model
# and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment 
# prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. 
# The following lines do all of these said operations:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])
This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good : 
 Positive
One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie : 
 Negative
In [ ]:
# Using the BERT on 5 test samples
predict_set = test[0:5]
pred_sentences = list(predict_set['processed'])

tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])
john cassavetes is on the run from the law he is at the bottom of the heap he sees negro sidney poitier as his equal and they quickly become friends forming a sort of alliance against a bully of a foreman played by jack wardenas someone who has worked in a warehouse myself when i was younger i can tell you that the warehouse fights complete with tumbling packing cases and flailing grappling hooks are as realistic as it gets ive been in fights like these myself although no one got killedthe introduction of sidney poitiers widow is a variation on shakespeares shylock do i not bleed this is an anti racist film which at the time was much neededall the three principle characters  warden cassavetes and poitier  are superb with warden the most outstanding of the three : 
 Positive
its not just that the movie is lame its more than that this movie is just unnecessary do we need another western how about a western with afroamericans in the titles roles sound stupid implausible and a lame attempt at modernizing the genre it is incredibly lame and simple minded its like that lame baz luhrman film romeo and juliet where he set it in modern times to attract young folks and create some hype with his revamping of a classic tale well baz luhrman failed miserably and so does this mess the story is actually not bad however the whole idea of removing the racism out of a racist genre by casting an all afroamerican cast is racist in itself its also puerile and simple minded like baz luhrmanman hes a bad director hey i hear you say this was directed by mario van peebles hes also in the film how can it be racist its not i said the idea of casting all afroamericans instead of caucasians was the film isnt racist its just pointless stupid and very very boring : 
 Negative
well if it werent for ethel waters and a 7yearold sammy davis jr here billed without the jr rufus jones for president would be one of the worst representations of africanamerican stereotypes ive seen from the early talkie era and wouldnt have been worth seeing because of that ms waters is excellent here singing am i blue and underneath our harlem moon while mr davis shows us how his childhood experience in showbiz prepared him for his superstar status as an adult hes so good tapdancing here that for awhile i thought he was a little person with decades of experience so if youre willing to ignore the negative connotations here rufus jones for president should provide some good enjoyment ps this marks the fourth time today ive seen and heard the song ill be glad when youre dead you rascal you performed on film this time by davis must have been a very popular song about this time : 
 Negative
i find alan jacobs review very accurate concerning the moviehowever i had the opportunity to rent the dvd from blockbuster with a commentary from byus curator motion picture archives james darc the then lds prophet heber j grant approved of the movie understanding the deviations from historic content for dramatic expression and telescoping events for example the movie showed joseph smith on trial despite brigham youngs great oratory in defense of joseph smith he was convicted anyway then joseph was killed historically joseph smith was never convicted of anything brigham young was in boston when joseph smith was arrested for this particular trial joseph smith and his brother hyrum where both killed before the trial took place : 
 Positive
this movie is simply awesome it is so hilarious although the skating and other montages are played out the comedy is awesome raab himself and brandon dicamillo are hilarious there will be moments when you cant breath youre laughing so hard plus there are scenes that you can watch hundreds of times and still laugh this is one of the funniest comedies ive ever seen : 
 Positive

Conclusion

In this project, we have learned how to clean and prepare the text data to feed into various ML/DL Models.

We have compared the performance of various ML/DL models with precision, recall, F1 and Accuracies (Train and Test).

There are several ideas that we can try to improve the model performance:

  • We can change dimension of the embedding layer
  • Hyperparameter tuning of various models
  • Different vectorization mehtods can also be tested
  • Text cleaning can further improve the model performance
  • More advanced transformers can also be tried in this project

Part-B: Solution

  • DOMAIN: Social media analytics
  • CONTEXT: Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.In this hands-on project, the goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.
  • DATA DESCRIPTION: The dataset is collected from two news websites, theonion.com and huffingtonpost.com. This new dataset has the following advantages over the existing Twitter datasets:

    • Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings.
    • Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets.
    • Unlike tweets that reply to other tweets, the news headlines obtained are self-contained. This would help us in teasing apart the real sarcastic elements
  • Content: Each record consists of three attributes:

  • PROJECT OBJECTIVE: Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments.

In [ ]:
# Import all the relevant libraries needed to complete the analysis, visualization, modeling and presentation
import pandas as pd
import numpy as np
import os

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')

from scipy import stats
from scipy.stats import zscore

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn import model_selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 
from sklearn.metrics import ConfusionMatrixDisplay, precision_score, recall_score 
from sklearn.metrics import precision_recall_curve, roc_curve, auc, roc_auc_score
from sklearn.metrics import plot_precision_recall_curve, average_precision_score
from sklearn.metrics import f1_score, plot_roc_curve 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.svm import SVC

# from sklearn.decomposition import PCA
# from scipy.cluster.hierarchy import dendrogram, linkage
# from scipy.cluster.hierarchy import fcluster
# from sklearn.cluster import KMeans 
# from sklearn.metrics import silhouette_samples, silhouette_score

# import xgboost as xgb
# from xgboost import plot_importance
# from lightgbm import LGBMClassifier

# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import AdaBoostClassifier
# from sklearn.ensemble import BaggingClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.ensemble import VotingClassifier

# from imblearn.over_sampling import RandomOverSampler
# from imblearn.over_sampling import SMOTENC, SMOTE, ADASYN
# from imblearn.under_sampling import RandomUnderSampler

import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import SnowballStemmer
import pandas_profiling as pp

import gensim
import logging

# import cv2
# from google.colab.patches import cv2_imshow
# from glob import glob
# import itertools

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.layers import Activation, GlobalMaxPool2D, GlobalAveragePooling2D
from tensorflow.keras.layers import UpSampling2D, Input, Concatenate
from tensorflow.keras.layers import BatchNormalization, LeakyReLU
from tensorflow.keras.optimizers import Adam, RMSprop, SGD, Adagrad

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.metrics import Recall, Precision
from tensorflow.keras import backend as K

from tensorflow import keras
from keras.utils.np_utils import to_categorical  
from keras.utils import np_utils
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor

import warnings
warnings.filterwarnings("ignore")

import random
from zipfile import ZipFile

# Set random_state
random_state = 42

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

1. Read and explore the data

In [ ]:
# Current working directory
%cd "/content/drive/MyDrive/MGL/Project-NLP-2/"

# # List all the files in a directory
# for dirname, _, filenames in os.walk('path'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))
/content/drive/MyDrive/MGL/Project-NLP-2
In [ ]:
# List files in the directory
!ls
 1.ipynb		 model.png
 2.ipynb		'NLP-2_Problem Statement.pdf'
 glove.6B.50d.txt	'NLP FAQ Sequential-1.pdf'
'IMDB Dataset.csv'	'REVALUATION POLICY-7.pdf'
'IMDB Dataset.csv.zip'	 Sarcasm_Headlines_Dataset_v2.json
'Milestone-NLP 2.pdf'	 Sarcasm_Headlines_Dataset_v2.json.zip
In [ ]:
# # Path of the data file
# path = 'Sarcasm_Headlines_Dataset_v2.json.zip'

# # Unzip files in the current directory

# with ZipFile (path,'r') as z:
#   z.extractall() 
# print("Training zip extraction done!")
In [ ]:
# Import the dataset
# Creat dataframe from the json file
df = pd.read_json('Sarcasm_Headlines_Dataset_v2.json', lines=True)
In [ ]:
df.shape
Out[ ]:
(28619, 3)
In [ ]:
pd.set_option('display.max_colwidth', None)
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28619 entries, 0 to 28618
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   is_sarcastic  28619 non-null  int64 
 1   headline      28619 non-null  object
 2   article_link  28619 non-null  object
dtypes: int64(1), object(2)
memory usage: 670.9+ KB
Out[ ]:
is_sarcastic headline article_link
0 1 thirtysomething scientists unveil doomsday clock of hair loss https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205
1 0 dem rep. totally nails why congress is falling short on gender, racial equality https://www.huffingtonpost.com/entry/donna-edwards-inequality_us_57455f7fe4b055bb1170b207
2 0 eat your veggies: 9 deliciously different recipes https://www.huffingtonpost.com/entry/eat-your-veggies-9-delici_b_8899742.html
3 1 inclement weather prevents liar from getting to work https://local.theonion.com/inclement-weather-prevents-liar-from-getting-to-work-1819576031
4 1 mother comes pretty close to using word 'streaming' correctly https://www.theonion.com/mother-comes-pretty-close-to-using-word-streaming-cor-1819575546
In [ ]:
# As the dataset is large; use a subset of the data. Let's Check what is working on the local machine.
# Can use 10,000/100,000 later
# df = pd.read_csv("blogtext.csv", nrows=1000) 
# df = df.sample(n=10000, random_state = 0)

# df.info()
In [ ]:
# Check for unique values: 1 = Sarcastic, 0 = Not Sarcastic
df.is_sarcastic.value_counts()
Out[ ]:
0    14985
1    13634
Name: is_sarcastic, dtype: int64
In [ ]:
# Check for NaN values
df.isna().sum() 
Out[ ]:
is_sarcastic    0
headline        0
article_link    0
dtype: int64
In [ ]:
# Describe function generates descriptive statistics that summarize the central tendency, 
# dispersion and shape of a dataset’s distribution, excluding NaN values.

# This method tells us a lot of things about a dataset. One important thing is that 
# the describe() method deals only with numeric values. It doesn't work with any 
# categorical values. So if there are any categorical values in a column the describe() 
# method will ignore it and display summary for the other columns.

df.describe(include='all').transpose()
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
is_sarcastic 28619.0 NaN NaN NaN 0.476397 0.499451 0.0 0.0 0.0 1.0 1.0
headline 28619 28503 'no way to prevent this,' says only nation where this regularly happens 12 NaN NaN NaN NaN NaN NaN NaN
article_link 28619 28617 https://politics.theonion.com/nation-not-sure-how-many-ex-trump-staffers-it-can-safel-1823468346 2 NaN NaN NaN NaN NaN NaN NaN
In [ ]:
# Clear the matplotlib plotting backend
%matplotlib inline
plt.close('all')
In [ ]:
# Understand the 'sentiment' the target vector
f,axes=plt.subplots(1,2,figsize=(17,7))
df['is_sarcastic'].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[0])
sns.countplot('is_sarcastic',data=df,ax=axes[1])
axes[0].set_title('Pie Chart for sarcasm')
axes[1].set_title('Bar Graph for sarcasm')
plt.show()

So, We can see that the dataset is balanced. Its good for a classification task.

2. Retain relevant columns

In [ ]:
df = df[['headline', 'is_sarcastic']]
df.head()
Out[ ]:
headline is_sarcastic
0 thirtysomething scientists unveil doomsday clock of hair loss 1
1 dem rep. totally nails why congress is falling short on gender, racial equality 0
2 eat your veggies: 9 deliciously different recipes 0
3 inclement weather prevents liar from getting to work 1
4 mother comes pretty close to using word 'streaming' correctly 1

3. Get length of each sentence

In [ ]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import plotly.express as px
from plotly.offline import init_notebook_mode
import re
import nltk
from nltk.corpus import stopwords
from tqdm import tqdm
from nltk.stem import WordNetLemmatizer
import spacy

tqdm.pandas()
spacy_eng = spacy.load("en_core_web_sm")
nltk.download('stopwords')
lemm = WordNetLemmatizer()
init_notebook_mode(connected=True)
sns.set_style("darkgrid")
plt.rcParams['figure.figsize'] = (20,8)
plt.rcParams['font.size'] = 18
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
In [ ]:
nltk.download('all')
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/book_grammars.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/city_database.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/comparative_sentences.zip.
[nltk_data]    | Downloading package comtrans to /root/nltk_data...
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package conll2007 to /root/nltk_data...
[nltk_data]    | Downloading package crubadan to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/crubadan.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package dolch to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dolch.zip.
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/europarl_raw.zip.
[nltk_data]    | Downloading package extended_omw to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package floresta to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/floresta.zip.
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/framenet_v15.zip.
[nltk_data]    | Downloading package framenet_v17 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/framenet_v17.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package ieer to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/ieer.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package indian to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/indian.zip.
[nltk_data]    | Downloading package jeita to /root/nltk_data...
[nltk_data]    | Downloading package kimmo to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/kimmo.zip.
[nltk_data]    | Downloading package knbc to /root/nltk_data...
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/large_grammars.zip.
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/lin_thesaurus.zip.
[nltk_data]    | Downloading package mac_morpho to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/mac_morpho.zip.
[nltk_data]    | Downloading package machado to /root/nltk_data...
[nltk_data]    | Downloading package masc_tagged to /root/nltk_data...
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping models/moses_sample.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package mte_teip5 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/mte_teip5.zip.
[nltk_data]    | Downloading package mwa_ppdb to /root/nltk_data...
[nltk_data]    |   Unzipping misc/mwa_ppdb.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package nombank.1.0 to /root/nltk_data...
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package nps_chat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/nps_chat.zip.
[nltk_data]    | Downloading package omw to /root/nltk_data...
[nltk_data]    | Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/opinion_lexicon.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package paradigms to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/paradigms.zip.
[nltk_data]    | Downloading package pe08 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/pe08.zip.
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping misc/perluniprops.zip.
[nltk_data]    | Downloading package pil to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    | Downloading package pl196x to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/pl196x.zip.
[nltk_data]    | Downloading package porter_test to /root/nltk_data...
[nltk_data]    |   Unzipping stemmers/porter_test.zip.
[nltk_data]    | Downloading package ppattach to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/ppattach.zip.
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/problem_reports.zip.
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/product_reviews_1.zip.
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/product_reviews_2.zip.
[nltk_data]    | Downloading package propbank to /root/nltk_data...
[nltk_data]    | Downloading package pros_cons to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/pros_cons.zip.
[nltk_data]    | Downloading package ptb to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/ptb.zip.
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Unzipping tokenizers/punkt.zip.
[nltk_data]    | Downloading package qc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/qc.zip.
[nltk_data]    | Downloading package reuters to /root/nltk_data...
[nltk_data]    | Downloading package rslp to /root/nltk_data...
[nltk_data]    |   Unzipping stemmers/rslp.zip.
[nltk_data]    | Downloading package rte to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/rte.zip.
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/sample_grammars.zip.
[nltk_data]    | Downloading package semcor to /root/nltk_data...
[nltk_data]    | Downloading package senseval to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/senseval.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/sentiwordnet.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/spanish_grammars.zip.
[nltk_data]    | Downloading package state_union to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/state_union.zip.
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/subjectivity.zip.
[nltk_data]    | Downloading package swadesh to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/swadesh.zip.
[nltk_data]    | Downloading package switchboard to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/switchboard.zip.
[nltk_data]    | Downloading package tagsets to /root/nltk_data...
[nltk_data]    |   Unzipping help/tagsets.zip.
[nltk_data]    | Downloading package timit to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/timit.zip.
[nltk_data]    | Downloading package toolbox to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/toolbox.zip.
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/twitter_samples.zip.
[nltk_data]    | Downloading package udhr to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr.zip.
[nltk_data]    | Downloading package udhr2 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr2.zip.
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/unicode_samples.zip.
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/universal_tagset.zip.
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package verbnet to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/verbnet.zip.
[nltk_data]    | Downloading package verbnet3 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/verbnet3.zip.
[nltk_data]    | Downloading package webtext to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/webtext.zip.
[nltk_data]    | Downloading package wmt15_eval to /root/nltk_data...
[nltk_data]    |   Unzipping models/wmt15_eval.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping models/word2vec_sample.zip.
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    | Downloading package wordnet2021 to /root/nltk_data...
[nltk_data]    | Downloading package wordnet2022 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet2022.zip.
[nltk_data]    | Downloading package wordnet31 to /root/nltk_data...
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/words.zip.
[nltk_data]    | Downloading package ycoe to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all
Out[ ]:
True
In [ ]:
# Text Cleaning:
# We will not remove numbers from the text data right away, lets further analyse if they contain any relevant information
# We can find the entity type of the tokens in the sentences using Named Entity Recognition (NER), this will help us identify
# the type and relevance of numbers in our text data

stop_words = stopwords.words('english')
stop_words.remove('not')

def text_cleaning(x):
    
    headline = re.sub('\s+\n+', ' ', x)
    headline = re.sub('[^a-zA-Z0-9]', ' ', x)
    headline = headline.lower()
    headline = headline.split()
    
    headline = [lemm.lemmatize(word, "v") for word in headline if not word in stop_words]
    headline = ' '.join(headline)
    
    return headline
In [ ]:
def get_entities(x):
    entity = []
    text = spacy_eng(x)
    for word in text.ents:
        entity.append(word.label_)
    return ",".join(entity)

df['entity'] = df['headline'].progress_apply(get_entities)
100%|██████████| 28619/28619 [03:57<00:00, 120.73it/s]
In [ ]:
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[ ]:
True
In [ ]:
# Dataset with entity, clean_headline and sentence_length
df['clean_headline'] = df['headline'].apply(text_cleaning)

df['sentence_length'] = df['clean_headline'].apply(lambda x: len(x.split()))
df
Out[ ]:
headline is_sarcastic entity clean_headline sentence_length
0 thirtysomething scientists unveil doomsday clock of hair loss 1 DATE thirtysomething scientists unveil doomsday clock hair loss 7
1 dem rep. totally nails why congress is falling short on gender, racial equality 0 ORG,ORG dem rep totally nail congress fall short gender racial equality 10
2 eat your veggies: 9 deliciously different recipes 0 CARDINAL eat veggies 9 deliciously different recipes 6
3 inclement weather prevents liar from getting to work 1 inclement weather prevent liar get work 6
4 mother comes pretty close to using word 'streaming' correctly 1 mother come pretty close use word stream correctly 8
... ... ... ... ... ...
28614 jews to celebrate rosh hashasha or something 1 NORP jews celebrate rosh hashasha something 5
28615 internal affairs investigator disappointed conspiracy doesn't go all the way to the top 1 internal affairs investigator disappoint conspiracy go way top 8
28616 the most beautiful acceptance speech this week came from a queer korean 0 DATE,NORP beautiful acceptance speech week come queer korean 7
28617 mars probe destroyed by orbiting spielberg-gates space palace 1 mar probe destroy orbit spielberg gate space palace 8
28618 dad clarifies this not a food stop 1 dad clarify not food stop 5

28619 rows × 5 columns

In [ ]:
# Headline length distribution
# Check for outliers in headline column
# Generally the headlines shouldn't be more than 20-40 words
# Box Plot

fig = px.histogram(df, x="sentence_length",height=700, color='is_sarcastic', title="Headlines Length Distribution", marginal="box")
fig.show(renderer="colab")
In [ ]:
df[df['sentence_length']==107]['headline']
Out[ ]:
7302    hot wheels ranked number one toy for rolling down ramp, knocking over dominoes that send marble down a funnel, dropping onto teeter-totter that yanks on string, causing pulley system to raise wooden block, propelling series of twine rollers that unwind spring, launching tennis ball across room, inching tire down slope until it hits power switch, activating table fan that blows toy ship with nail attached to it across kiddie pool, popping water balloon that fills cup, weighing down lever that forces basketball down track, nudging broomstick on axis to rotate, allowing golf ball to roll into sideways coffee mug, which tumbles down row of hardcover books until handle catches hook attached to lever that causes wooden mallet to slam down on serving spoon, catapulting small ball into cup attached by ribbon to lazy susan, which spins until it pushes d battery down incline plane, tipping over salt shaker to season omelet
Name: headline, dtype: object
In [ ]:
df.drop(df[df['sentence_length'] == 107].index, inplace = True)
df.reset_index(inplace=True, drop=True)
In [ ]:
# Headline length distribution: Outliers Removed
# The headlines after the removal of outliers do not exceed the limit of 20-40 words
# They are mostly centered in the range of 5-10 words
fig = px.histogram(df, x="sentence_length",height=700, color='is_sarcastic', title="Headlines Length Distribution", marginal="box")
fig.show(renderer="colab")
In [ ]:
# Filtering: Find Sentences that Contain Numbers
df['contains_number'] = df['clean_headline'].apply(lambda x: bool(re.search(r'\d+', x)))
df
Out[ ]:
headline is_sarcastic entity clean_headline sentence_length contains_number
0 thirtysomething scientists unveil doomsday clock of hair loss 1 DATE thirtysomething scientists unveil doomsday clock hair loss 7 False
1 dem rep. totally nails why congress is falling short on gender, racial equality 0 ORG,ORG dem rep totally nail congress fall short gender racial equality 10 False
2 eat your veggies: 9 deliciously different recipes 0 CARDINAL eat veggies 9 deliciously different recipes 6 True
3 inclement weather prevents liar from getting to work 1 inclement weather prevent liar get work 6 False
4 mother comes pretty close to using word 'streaming' correctly 1 mother come pretty close use word stream correctly 8 False
... ... ... ... ... ... ...
28613 jews to celebrate rosh hashasha or something 1 NORP jews celebrate rosh hashasha something 5 False
28614 internal affairs investigator disappointed conspiracy doesn't go all the way to the top 1 internal affairs investigator disappoint conspiracy go way top 8 False
28615 the most beautiful acceptance speech this week came from a queer korean 0 DATE,NORP beautiful acceptance speech week come queer korean 7 False
28616 mars probe destroyed by orbiting spielberg-gates space palace 1 mar probe destroy orbit spielberg gate space palace 8 False
28617 dad clarifies this not a food stop 1 dad clarify not food stop 5 False

28618 rows × 6 columns

Analysis of samples containing numbers of Time, Date or Cardinal Entity:

  • The numbers in a text data can have different implications
  • While the naive text preprocessing methods suggest that the numbers should be removed along with the special characters
  • The entity type of these numbers should be identified to get their exact implications
In [ ]:
# Date Entity: Randome Samples
df[(df['contains_number']) & (df['sentence_length']<=5) & (df['entity']=='DATE')].sample(10)
Out[ ]:
headline is_sarcastic entity clean_headline sentence_length contains_number
16369 if you thought 2016 was terrible, you're actually in the minority 0 DATE think 2016 terrible actually minority 5 True
25942 trump brags that he won most of the women's vote in 2016. he didn't. 0 DATE trump brag women vote 2016 5 True
3590 what's ahead for reputation in 2015 0 DATE ahead reputation 2015 3 True
26656 news roundup for august 17, 2017 0 DATE news roundup august 17 2017 5 True
15007 the 2016 'dumbing down' of america 0 DATE 2016 dumbing america 3 True
10120 the best teams in sports... 5 years from now 0 DATE best team sport 5 years 5 True
24788 once upon a festival 2015 is upon us 0 DATE upon festival 2015 upon us 5 True
6864 2016 perspectives from the festival of politics 0 DATE 2016 perspectives festival politics 4 True
21169 those we lost in 2011 1 DATE lose 2011 2 True
15398 the best tv shows of 2014 0 DATE best tv show 2014 4 True
In [ ]:
# Time Entity: Randome Samples
df[(df['contains_number']) & (df['sentence_length']<=5) & (df['entity']=='TIME')].sample(10)
Out[ ]:
headline is_sarcastic entity clean_headline sentence_length contains_number
21538 jcpenney abandons 45-second sale 1 TIME jcpenney abandon 45 second sale 5 True
13455 it's just 15 minutes to a grown-up, but not to kids 0 TIME 15 minutes grow not kid 5 True
26822 email from mom sent at 5:32 a.m. 1 TIME email mom send 5 32 5 True
12158 oven preheated for 16 seconds 1 TIME oven preheat 16 second 4 True
25836 23-hour suicide watch a failure 1 TIME 23 hour suicide watch failure 5 True
2240 donut shop gets weird after 11 a.m. 1 TIME donut shop get weird 11 5 True
3409 man turns vegetarian for 36 hours 1 TIME man turn vegetarian 36 hours 5 True
19067 the 1 minute blog. protesters and looting. 0 TIME 1 minute blog protesters loot 5 True
18283 5-minute hairstyles -- for real! 0 TIME 5 minute hairstyles real 4 True
2961 how to be nicer, healthier and more focused in 15 minutes 0 TIME nicer healthier focus 15 minutes 5 True
In [ ]:
# Cardinal Entity: Randome Samples
df[(df['contains_number']) & (df['sentence_length']<=5) & (df['entity']=='CARDINAL')].sample(10)
Out[ ]:
headline is_sarcastic entity clean_headline sentence_length contains_number
25261 senator's myspace top 8 all corporations 1 CARDINAL senator myspace top 8 corporations 5 True
8241 10 of illinois' safest cities 0 CARDINAL 10 illinois safest cities 4 True
7715 5 viruses that are scarier than ebola 0 CARDINAL 5 viruses scarier ebola 4 True
9351 7 strategies for lasting fat loss 0 CARDINAL 7 strategies last fat loss 5 True
6783 347 locals identify slain prostitute 1 CARDINAL 347 locals identify slay prostitute 5 True
13517 9 things smart people won't do 0 CARDINAL 9 things smart people 4 True
624 10 things not to do before your next race 0 CARDINAL 10 things not next race 5 True
25754 6 signs you're in a band-aid relationship (and what to do about it) 0 CARDINAL 6 sign band aid relationship 5 True
10224 5 lessons from chibok 0 CARDINAL 5 lessons chibok 3 True
20441 my beautiful reward and the 7 lessons it has taught me 0 CARDINAL beautiful reward 7 lessons teach 5 True

Inference from NER:

  • For some headlines, its important to retain the date, time and cardinal information
  • Special tokenization can be considered to retain the meaning of these numbers
  • Vocab size can be reduced further by removing these numbers
  • More research is required to improve the quality of vectorization and modeling performance
In [ ]:
# Wordcloud for text that is Not Sarcastic (LABEL - 0)
plt.figure(figsize = (20,20)) # Text that is Not Sarcastic
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(df[df.is_sarcastic == 0].headline))
plt.imshow(wc , interpolation = 'bilinear')
Out[ ]:
<matplotlib.image.AxesImage at 0x7fbc241730a0>
In [ ]:
# Wordcloud for text that is Sarcastic (LABEL - 1)
plt.figure(figsize = (20,20)) # Text that is Sarcastic
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(df[df.is_sarcastic == 1].headline))
plt.imshow(wc , interpolation = 'bilinear')
Out[ ]:
<matplotlib.image.AxesImage at 0x7fbc22fd6e50>

4. Define parameters

5. Get indices for words

6. Create features and labels

7. Get vocabulary size

8. Create a weight matrix using GloVe embeddings

9. Define and compile a Bidirectional LSTM model.

Hint: Be analytical and experimental here in trying new approaches to design the best model.

10. Fit the model and check the validation accuracy

Considering the above 4, 5, 6, 7, 8, 9, 10 parts together in below code cells:

In [ ]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential, Model

from tensorflow.keras import layers
from tensorflow.keras.layers import Embedding, Layer, Dense, Dropout, LayerNormalization, Input, GlobalAveragePooling1D
from tensorflow.keras.layers import LSTM, Bidirectional, SimpleRNN, GRU, Conv1D,  MultiHeadAttention, AveragePooling1D
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
In [ ]:
X = df['clean_headline']
y = df['is_sarcastic']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
In [ ]:
# Tokenization
# Splitting sentences into words
# Finding the vocab size
# Important Parameters to consider
max_len = 20  
embedding_dim = 50     
oov_token = '00_V' 
padding_type = 'post'
trunc_type = 'post'  

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
vocab_size = len(tokenizer.word_index) + 1
print("Vocab Size: ",vocab_size)
Vocab Size:  18276
In [ ]:
# Encoding of Inputs
# Converting the sentences to token followed by padded sequences in encoded format
# These are numeric encodings assigned to each word
train_sequences = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(train_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

test_sequences = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
In [ ]:
X_train[0]
Out[ ]:
array([ 813, 1144, 2021,  487,  294, 2272,   25,  333,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)
In [ ]:
y_train[0]
Out[ ]:
1
In [ ]:
# # Path of the data file
# path = 'glove.6B.zip'

# # Unzip files in the current directory

# with ZipFile (path,'r') as z:
#   z.extractall() 
# print("Training zip extraction done!")
In [ ]:
# Embedding matrix with 50 dimensions
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()
glove_file = open('glove.6B.50d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove_file.close()

vocab_size = len(tokenizer.word_index)+1

# Creating a embedding matrix for initial weights based on the precreated glove embedding

embedding_matrix = zeros((vocab_size, 50))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

ANN

In [ ]:
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length = max_len))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 20, 50)            913800    
                                                                 
 global_average_pooling1d (G  (None, 50)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 64)                3264      
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 917,129
Trainable params: 917,129
Non-trainable params: 0
_________________________________________________________________
In [ ]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test))
Epoch 1/10
157/157 [==============================] - 20s 91ms/step - loss: 0.6250 - accuracy: 0.6622 - val_loss: 0.5719 - val_accuracy: 0.7021
Epoch 2/10
157/157 [==============================] - 4s 24ms/step - loss: 0.5041 - accuracy: 0.7558 - val_loss: 0.4951 - val_accuracy: 0.7607
Epoch 3/10
157/157 [==============================] - 3s 16ms/step - loss: 0.4152 - accuracy: 0.8124 - val_loss: 0.4598 - val_accuracy: 0.7828
Epoch 4/10
157/157 [==============================] - 3s 19ms/step - loss: 0.3483 - accuracy: 0.8472 - val_loss: 0.4476 - val_accuracy: 0.7918
Epoch 5/10
157/157 [==============================] - 2s 12ms/step - loss: 0.2932 - accuracy: 0.8763 - val_loss: 0.4550 - val_accuracy: 0.7955
Epoch 6/10
157/157 [==============================] - 2s 10ms/step - loss: 0.2498 - accuracy: 0.8985 - val_loss: 0.4646 - val_accuracy: 0.7966
Epoch 7/10
157/157 [==============================] - 1s 8ms/step - loss: 0.2125 - accuracy: 0.9157 - val_loss: 0.4949 - val_accuracy: 0.7949
Epoch 8/10
157/157 [==============================] - 1s 8ms/step - loss: 0.1830 - accuracy: 0.9296 - val_loss: 0.5057 - val_accuracy: 0.7959
Epoch 9/10
157/157 [==============================] - 2s 10ms/step - loss: 0.1584 - accuracy: 0.9409 - val_loss: 0.5348 - val_accuracy: 0.7956
Epoch 10/10
157/157 [==============================] - 1s 9ms/step - loss: 0.1362 - accuracy: 0.9515 - val_loss: 0.5744 - val_accuracy: 0.7940
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
269/269 [==============================] - 1s 2ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
626/626 [==============================] - 2s 3ms/step - loss: 0.1098 - accuracy: 0.9650
Loss and Accuracy on Training data: [0.1098141148686409, 0.9650059938430786]
269/269 [==============================] - 1s 3ms/step - loss: 0.5744 - accuracy: 0.7940
Loss and Accuracy on Test data: [0.5743972659111023, 0.7939669489860535]

Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.83      0.81      4514
           1       0.80      0.76      0.78      4072

    accuracy                           0.79      8586
   macro avg       0.79      0.79      0.79      8586
weighted avg       0.79      0.79      0.79      8586

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

base_1 = []
base_1.append(['ANN', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
626/626 [==============================] - 2s 3ms/step - loss: 0.1098 - accuracy: 0.9650
269/269 [==============================] - 1s 2ms/step - loss: 0.5744 - accuracy: 0.7940

RNN

In [ ]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length = max_len))
model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(AveragePooling1D(pool_size = 2))
# model.add(Bidirectional(SimpleRNN(64, dropout = 0.5)))
model.add((SimpleRNN(64, dropout = 0.5)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test), verbose=1)
Epoch 1/10
157/157 [==============================] - 21s 87ms/step - loss: 0.6261 - accuracy: 0.6423 - val_loss: 0.5682 - val_accuracy: 0.7052
Epoch 2/10
157/157 [==============================] - 7s 45ms/step - loss: 0.5309 - accuracy: 0.7375 - val_loss: 0.5032 - val_accuracy: 0.7568
Epoch 3/10
157/157 [==============================] - 4s 27ms/step - loss: 0.4556 - accuracy: 0.7898 - val_loss: 0.4620 - val_accuracy: 0.7853
Epoch 4/10
157/157 [==============================] - 5s 35ms/step - loss: 0.3903 - accuracy: 0.8278 - val_loss: 0.4557 - val_accuracy: 0.7983
Epoch 5/10
157/157 [==============================] - 4s 24ms/step - loss: 0.3398 - accuracy: 0.8560 - val_loss: 0.4444 - val_accuracy: 0.8027
Epoch 6/10
157/157 [==============================] - 4s 26ms/step - loss: 0.2925 - accuracy: 0.8779 - val_loss: 0.4462 - val_accuracy: 0.8058
Epoch 7/10
157/157 [==============================] - 4s 27ms/step - loss: 0.2441 - accuracy: 0.9043 - val_loss: 0.4407 - val_accuracy: 0.8053
Epoch 8/10
157/157 [==============================] - 5s 32ms/step - loss: 0.2094 - accuracy: 0.9190 - val_loss: 0.4639 - val_accuracy: 0.8090
Epoch 9/10
157/157 [==============================] - 4s 22ms/step - loss: 0.1755 - accuracy: 0.9353 - val_loss: 0.5090 - val_accuracy: 0.8070
Epoch 10/10
157/157 [==============================] - 3s 22ms/step - loss: 0.1496 - accuracy: 0.9453 - val_loss: 0.5896 - val_accuracy: 0.8078
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
269/269 [==============================] - 1s 3ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
626/626 [==============================] - 3s 4ms/step - loss: 0.0825 - accuracy: 0.9732
Loss and Accuracy on Training data: [0.08250616490840912, 0.9731928706169128]
269/269 [==============================] - 2s 6ms/step - loss: 0.5896 - accuracy: 0.8078
Loss and Accuracy on Test data: [0.5896381735801697, 0.8078266978263855]

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.83      0.82      4514
           1       0.81      0.78      0.79      4072

    accuracy                           0.81      8586
   macro avg       0.81      0.81      0.81      8586
weighted avg       0.81      0.81      0.81      8586

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

# base_1 = []
base_1.append(['RNN', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
626/626 [==============================] - 2s 4ms/step - loss: 0.0825 - accuracy: 0.9732
269/269 [==============================] - 1s 4ms/step - loss: 0.5896 - accuracy: 0.8078

GRU

In [ ]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length = max_len))
model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(AveragePooling1D(pool_size = 2))
# model.add(Bidirectional(GRU(32, dropout = 0.5)))
model.add((GRU(64, dropout = 0.5)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test), verbose=1)
Epoch 1/10
157/157 [==============================] - 14s 70ms/step - loss: 0.6116 - accuracy: 0.6621 - val_loss: 0.5454 - val_accuracy: 0.7234
Epoch 2/10
157/157 [==============================] - 4s 24ms/step - loss: 0.4973 - accuracy: 0.7589 - val_loss: 0.4767 - val_accuracy: 0.7697
Epoch 3/10
157/157 [==============================] - 3s 19ms/step - loss: 0.4163 - accuracy: 0.8123 - val_loss: 0.4491 - val_accuracy: 0.7906
Epoch 4/10
157/157 [==============================] - 3s 21ms/step - loss: 0.3506 - accuracy: 0.8471 - val_loss: 0.4262 - val_accuracy: 0.8018
Epoch 5/10
157/157 [==============================] - 1s 9ms/step - loss: 0.2955 - accuracy: 0.8751 - val_loss: 0.4531 - val_accuracy: 0.8050
Epoch 6/10
157/157 [==============================] - 2s 12ms/step - loss: 0.2492 - accuracy: 0.8970 - val_loss: 0.4345 - val_accuracy: 0.8072
Epoch 7/10
157/157 [==============================] - 1s 9ms/step - loss: 0.2054 - accuracy: 0.9193 - val_loss: 0.4830 - val_accuracy: 0.8020
Epoch 8/10
157/157 [==============================] - 2s 11ms/step - loss: 0.1706 - accuracy: 0.9346 - val_loss: 0.5227 - val_accuracy: 0.8074
Epoch 9/10
157/157 [==============================] - 1s 9ms/step - loss: 0.1399 - accuracy: 0.9461 - val_loss: 0.5839 - val_accuracy: 0.8034
Epoch 10/10
157/157 [==============================] - 2s 11ms/step - loss: 0.1196 - accuracy: 0.9539 - val_loss: 0.6076 - val_accuracy: 0.8012
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
269/269 [==============================] - 1s 2ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
626/626 [==============================] - 3s 4ms/step - loss: 0.0673 - accuracy: 0.9789
Loss and Accuracy on Training data: [0.06728716194629669, 0.9788838028907776]
269/269 [==============================] - 1s 4ms/step - loss: 0.6076 - accuracy: 0.8012
Loss and Accuracy on Test data: [0.6075868606567383, 0.8011879920959473]

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.86      0.82      4514
           1       0.83      0.73      0.78      4072

    accuracy                           0.80      8586
   macro avg       0.80      0.80      0.80      8586
weighted avg       0.80      0.80      0.80      8586

Confusion Matrix Chart:
In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

# base_1 = []
base_1.append(['GRU', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
626/626 [==============================] - 2s 3ms/step - loss: 0.0673 - accuracy: 0.9789
269/269 [==============================] - 1s 3ms/step - loss: 0.6076 - accuracy: 0.8012

LSTM

In [ ]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length = max_len))
model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(AveragePooling1D(pool_size = 2))
model.add(Bidirectional(LSTM(64, dropout = 0.5)))
# model.add((LSTM(32, dropout = 0.5)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

H = model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_test, y_test), verbose=1)
Epoch 1/10
157/157 [==============================] - 16s 74ms/step - loss: 0.5992 - accuracy: 0.6712 - val_loss: 0.5304 - val_accuracy: 0.7345
Epoch 2/10
157/157 [==============================] - 4s 28ms/step - loss: 0.4799 - accuracy: 0.7677 - val_loss: 0.4624 - val_accuracy: 0.7807
Epoch 3/10
157/157 [==============================] - 4s 27ms/step - loss: 0.3876 - accuracy: 0.8238 - val_loss: 0.4312 - val_accuracy: 0.8000
Epoch 4/10
157/157 [==============================] - 3s 17ms/step - loss: 0.3139 - accuracy: 0.8660 - val_loss: 0.4350 - val_accuracy: 0.8119
Epoch 5/10
157/157 [==============================] - 2s 13ms/step - loss: 0.2484 - accuracy: 0.8985 - val_loss: 0.4459 - val_accuracy: 0.8116
Epoch 6/10
157/157 [==============================] - 2s 15ms/step - loss: 0.1965 - accuracy: 0.9216 - val_loss: 0.4713 - val_accuracy: 0.8152
Epoch 7/10
157/157 [==============================] - 2s 13ms/step - loss: 0.1566 - accuracy: 0.9393 - val_loss: 0.5337 - val_accuracy: 0.8109
Epoch 8/10
157/157 [==============================] - 3s 17ms/step - loss: 0.1268 - accuracy: 0.9512 - val_loss: 0.5809 - val_accuracy: 0.8097
Epoch 9/10
157/157 [==============================] - 2s 15ms/step - loss: 0.1005 - accuracy: 0.9620 - val_loss: 0.6403 - val_accuracy: 0.8081
Epoch 10/10
157/157 [==============================] - 2s 14ms/step - loss: 0.0836 - accuracy: 0.9676 - val_loss: 0.7254 - val_accuracy: 0.8024
In [ ]:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.plot(H.history['accuracy'], label = 'Train')
plt.plot(H.history['val_accuracy'], label = 'Validation')
plt.legend()
plt.title('Accuracy')

plt.subplot(1,2,2)
plt.plot(H.history['loss'], label = 'Train')
plt.plot(H.history['val_loss'], label = 'Validation')
plt.legend()
plt.title('Loss')

plt.show()
In [ ]:
y_pred_proba = model.predict(X_test)
y_pred = np.array([0 if proba < 0.5 else 1 for proba in y_pred_proba])
269/269 [==============================] - 2s 3ms/step
In [ ]:
# Classification Accuracy
print("Classification Accuracy:")
print('Loss and Accuracy on Training data:',model.evaluate(X_train, y_train))
print('Loss and Accuracy on Test data:',model.evaluate(X_test, y_test))
print()

# Classification Report
print("Classification Report:\n",classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix Chart:")
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ['0', '1']],  
                         columns = [i for i in ['0', '1']])
plt.figure(figsize = (12,10))
sns.heatmap(df_cm, annot=True, fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Classification Accuracy:
626/626 [==============================] - 3s 5ms/step - loss: 0.0391 - accuracy: 0.9889
Loss and Accuracy on Training data: [0.03913184627890587, 0.9889177083969116]
269/269 [==============================] - 2s 6ms/step - loss: 0.7254 - accuracy: 0.8024
Loss and Accuracy on Test data: [0.7253589630126953, 0.8023526668548584]

Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.84      0.82      4514
           1       0.81      0.76      0.78      4072

    accuracy                           0.80      8586
   macro avg       0.80      0.80      0.80      8586
weighted avg       0.80      0.80      0.80      8586

Confusion Matrix Chart:

Model Comparison

In [ ]:
# Model comparison
precision = precision_score(y_test,y_pred, average='macro')
recall = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')

Train_Accuracy = model.evaluate(X_train, y_train)
Test_Accuracy = model.evaluate(X_test, y_test)

# base_1 = []
base_1.append(['LSTM', Train_Accuracy[1], Test_Accuracy[1], precision, recall, f1])
model_comparison = pd.DataFrame(base_1,columns=['Model','Train Accuracy','Test Accuracy','Precision','Recall','F1 Score'])
model_comparison.sort_values(by=['Recall','F1 Score'], inplace=True, ascending=False)
model_comparison
626/626 [==============================] - 2s 4ms/step - loss: 0.0391 - accuracy: 0.9889
269/269 [==============================] - 1s 4ms/step - loss: 0.7254 - accuracy: 0.8024
Out[ ]:
Model Train Accuracy Test Accuracy Precision Recall F1 Score
1 RNN 0.973193 0.807827 0.807780 0.806535 0.806970
3 LSTM 0.988918 0.802353 0.803536 0.800114 0.800925
2 GRU 0.978884 0.801188 0.804656 0.797936 0.799004
0 ANN 0.965006 0.793967 0.794422 0.792103 0.792739

BERT

In [ ]:
df1 = df[['clean_headline', 'is_sarcastic']]
df1.head()
Out[ ]:
clean_headline is_sarcastic
0 thirtysomething scientists unveil doomsday clock hair loss 1
1 dem rep totally nail congress fall short gender racial equality 0
2 eat veggies 9 deliciously different recipes 0
3 inclement weather prevent liar get work 1
4 mother come pretty close use word stream correctly 1
In [ ]:
# Split the data for training and testing
# To be used in the transformers (BERT)
train, test = train_test_split(df1, test_size=0.5, random_state=0)
In [ ]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
In [ ]:
# Install Transformers library
!pip install transformers
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 49.1 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 100.6 MB/s eta 0:00:00
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers) (6.0)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.8/dist-packages (from transformers) (4.64.1)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 KB 25.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (1.21.6)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers) (23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers) (3.9.0)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers) (2.25.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (4.5.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (4.0.0)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
In [ ]:
# Load the BERT Classifier and Tokenizer alıng with Input modules
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [ ]:
# We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:
model.summary()
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. 
# We will take advantage of the InputExample function that helps us to create sequences from our dataset. 
# The InputExample function can be called as follows:
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)
Out[ ]:
InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

Now we will create two main functions:

  1. convert_data_to_examples: This will accept our train and test datasets and convert each row into an InputExample object.
  2. convert_examples_to_tf_dataset: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.
In [ ]:
def convert_data_to_examples(train, test, clean_headline, is_sarcastic): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[clean_headline], 
                                                          text_b = None,
                                                          label = x[is_sarcastic]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[clean_headline], 
                                                          text_b = None,
                                                          label = x[is_sarcastic]), axis = 1)
  
  return train_InputExamples, validation_InputExamples

  train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'clean_headline', 
                                                                           'is_sarcastic')
  
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


clean_headline = 'clean_headline'
is_sarcastic = 'is_sarcastic'
In [ ]:
# Our dataset containing processed input sequences are ready to be fed to the model.
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, clean_headline, is_sarcastic)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)
In [ ]:
# We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. 
# Fine-tuning the model for 2 epochs will give us around 90% accuracy, which is great.

# Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings.

%%time

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

H = model.fit(train_data, epochs=2, validation_data=validation_data)

# 30 min for maxlen = 128
Epoch 1/2
896/896 [==============================] - 922s 941ms/step - loss: 0.3645 - accuracy: 0.8313 - val_loss: 0.3927 - val_accuracy: 0.8394
Epoch 2/2
896/896 [==============================] - 834s 931ms/step - loss: 0.0924 - accuracy: 0.9670 - val_loss: 0.4919 - val_accuracy: 0.8498
CPU times: user 15min 1s, sys: 5min 11s, total: 20min 13s
Wall time: 29min 16s

6. Use the designed model to print the prediction on any one sample.

In [ ]:
# Making Predictions
# I created a list of two reviews I created. The first one is a sarcastic review, while the second one is cnot sarcastic.
pred_sentences = ['What planet did you come from?',
                  'This is really a very beautiful pic']
In [ ]:
# We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model
# and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment 
# prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. 
# The following lines do all of these said operations:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['0','1']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])
What planet did you come from? : 
 0
This is really a very beautiful pic : 
 0
In [ ]:
# Using the BERT on 5 test samples
predict_set = test[0:5]
pred_sentences = list(predict_set['clean_headline'])

tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['0','1']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])
exasperate huckabee sanders remind press corps children 14 feel pain : 
 1
tampon ads honest : 
 0
moviegoer manage sneak candy past teenage usher earn 7 hour : 
 1
noaa predict see hurricanes year 2015 : 
 0
new lawn care product make neighbor lawn less green : 
 1

Conclusion:

  • In this notebook, We used text preprocessing to prepare the data and make it compatible for vaious ML/DL models alongwith the required EDA.
  • Compared performances of various models like ANN, RNN, GRU, LSTM and BERT.
  • Using hyperparameter tuning, we can further improve the performance of various models.
  • Text cleaning by considering the NER aspects can further improve the model performance
  • More advanced transformers can also be tried in this project